Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a TypeScript AI search stack: embedding service, vector store with persistence, semantic and BM25 keyword search engines, a hybrid search coordinator with reciprocal rank fusion, shared types and barrel exports, project configs, and extensive unit tests. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Hybrid as HybridSearchService
participant Semantic as SemanticSearchService
participant Keyword as KeywordSearchEngine
participant Embedding as EmbeddingService
participant Store as VectorStore
participant RRF as RRF
User->>Hybrid: search(query, options)
par semantic path
Hybrid->>Semantic: search(query, options)
Semantic->>Embedding: embed(query)
Embedding-->>Semantic: queryVector
Semantic->>Store: search(queryVector, topK, minScore)
Store-->>Semantic: semanticResults
Semantic-->>Hybrid: semanticResults
and keyword path
Hybrid->>Keyword: search(query, topK)
Keyword-->>Hybrid: keywordResults
end
Hybrid->>RRF: reciprocalRankFusion([semanticResults, keywordResults], k)
RRF-->>Hybrid: fusedResults
Hybrid-->>User: topK fusedResults
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai/__tests__/hybridSearch.test.ts`:
- Around line 42-51: The test helper buildService creates a VectorStore without
a modelId, so index compatibility won't be checked; update buildService to pass
a modelId (e.g., use the EmbeddingService's modelId) into the VectorStore config
when constructing it: after creating embeddingService (EmbeddingService),
include modelId: embeddingService.modelId in the new VectorStore({...}) options
so VectorStore performs the same model compatibility/invalidation behavior as in
production before returning the HybridSearchService.
In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 183-195: The test "skips write when nothing changed" currently
only checks file contents and can pass even if the file was rewritten; change it
to assert that the file was not rewritten by importing stat from
node:fs/promises and using stat(join(tempDir, "test-index.json")) to capture
file metadata (e.g., mtimeMs or size) before the second await store.save(), call
store.save() again, then stat the file again and expect the metadata to be
unchanged; alternatively, spy on the write path used by the VectorStore
persistence (the function that writes the file) to assert it was not called on
the second save. Also add the stat import to the top-level imports.
In `@src/ai/embeddings.ts`:
- Around line 88-91: The embedSingle method currently returns results[0] without
verifying the array is non-empty; update embedSingle to defensively validate the
output of embed (call to this.embed([text])) — check that results is an array
with at least one element and that results[0] is defined, and if not throw a
clear error (or handle per project error conventions) so callers won't receive
undefined; modify the embedSingle function to perform this guard and surface a
descriptive error referencing embedSingle/embed.
In `@src/ai/hybridSearch.ts`:
- Around line 7-29: reciprocalRankFusion is currently writing RRF values into
SearchResult.score which breaks the documented semantic (cosine) score; update
the implementation to preserve the original cosine similarity and expose the
fusion ranking in a separate field or type (e.g., create a new
HybridSearchResult type with fields { chunk: TextChunk; score: number;
fusionScore: number } or add fusionScore to the returned objects), set
fusionScore to the computed rrfScore while keeping score as the original
item.score, and do the same change in HybridSearchService.search (the block
referenced by lines 65-94) so callers receive the original similarity in score
and the fusion metric in fusionScore or the new hybrid result type.
- Around line 37-56: Currently indexNote calls semanticSearch.indexNote(noteId,
content) which re-chunks internally while the hybrid layer separately calls
this.chunkFn(noteId, content) for keyword indexing; instead generate chunks once
(const chunks = this.chunkFn(noteId, content)) and pass that same chunks array
to both engines (e.g., change semanticSearch.indexNote to accept chunks or add a
new method like semanticSearch.indexChunks(noteId, chunks)) then call
semanticSearch.indexChunks(noteId, chunks) and keywordEngine.indexChunks(chunks)
after removing by noteId so both engines index identical chunk objects.
In `@src/ai/semanticSearch.ts`:
- Around line 42-46: The initialize() sequence can leave embeddingService
initialized if vectorStore.load() throws; wrap the vectorStore.load() call in a
try/catch and on failure perform a deterministic rollback of the embedding
service (call the appropriate cleanup/shutdown/dispose method on
embeddingService, e.g., embeddingService.shutdown() or
embeddingService.dispose()) before rethrowing the error so state remains
consistent; keep the calls inside the existing initialize() method and preserve
the original error propagation.
- Around line 84-108: The hardcoded minScore 0.4 inside findRelatedNotes (used
in vectorStore.findRelated) should be extracted and documented: add a
configurable parameter or class-level constant (e.g., a new optional parameter
minScore: number = 0.4 on findRelatedNotes or a private readonly
DEFAULT_MIN_SCORE = 0.4 on the containing class) and replace the literal 0.4
with that identifier; update the method signature and call sites accordingly and
add a short comment explaining the threshold purpose and default.
In `@src/ai/vectorStore.ts`:
- Around line 123-124: Wrap the JSON.parse call that converts the file content
into a SerializedVectorStore in a try/catch to handle SyntaxError from
malformed/corrupted index files: around the lines using readFile(filePath,
"utf-8") and JSON.parse(raw) catch parsing errors, log or include the original
error details and throw a new, user-friendly error indicating the index file at
filePath is invalid and should be rebuilt (preserve the original error as cause
or append it to the message) so callers of the vector store load routine can
surface a clear rebuild instruction.
- Around line 151-157: The loaded index JSON needs structural validation before
populating this.entries: verify that the top-level data is an object and
data.entries is an array, then iterate and check each entry has entry.chunk &&
entry.chunk.id (string/number) and entry.vector (array/typed values) before
calling this.entries.set; for any invalid/missing fields either skip the entry
with a descriptive processLogger.warn or throw a clear Error that includes which
entry/index failed, so the loader (the method that currently clears this.entries
and loops over data.entries) fails with an actionable message instead of a
cryptic runtime error.
In `@vitest.config.ts`:
- Around line 3-14: Add explicit Vitest/Vite ESM handling for the
`@huggingface/transformers` package so tests that exercise real initialize() paths
(e.g., in embeddings.ts where initialize() may dynamic-import the library or
when embedFn is not provided) don't break under Node ESM resolution; update the
vitest.config.ts test config to include the package in server.deps.inline or
ssr.noExternal (or both) to force inlining/noExternal for
"@huggingface/transformers" so the module resolves correctly during tests and
runtime.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 399f4a0c-1dda-4f24-aa00-d754cf51318e
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (17)
.gitignorepackage.jsonsrc/ai/__tests__/embeddings.test.tssrc/ai/__tests__/hybridSearch.test.tssrc/ai/__tests__/keywordSearch.test.tssrc/ai/__tests__/semanticSearch.test.tssrc/ai/__tests__/vectorStore.test.tssrc/ai/embeddings.tssrc/ai/hybridSearch.tssrc/ai/index.tssrc/ai/keywordSearch.tssrc/ai/semanticSearch.tssrc/ai/types.tssrc/ai/vectorStore.tstsconfig.build.jsontsconfig.jsonvitest.config.ts
| function buildService(embedFn?: EmbedFn) { | ||
| const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn()); | ||
| const vectorStore = new VectorStore({ | ||
| persistDir: "/tmp/hybrid-test", | ||
| indexFilename: "test-index.json", | ||
| }); | ||
| const semantic = new SemanticSearchService(embeddingService, vectorStore); | ||
| const keyword = new KeywordSearchEngine(); | ||
| return new HybridSearchService(semantic, keyword); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Test buildService omits modelId from VectorStore config.
The VectorStore is created without modelId, which means it won't perform model compatibility checks during load. While acceptable for isolated tests, this deviates from production behavior where modelId enables index invalidation on model changes.
Consider adding modelId to align test configuration with production:
const vectorStore = new VectorStore({
persistDir: "/tmp/hybrid-test",
indexFilename: "test-index.json",
+ modelId: "mock-model",
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function buildService(embedFn?: EmbedFn) { | |
| const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn()); | |
| const vectorStore = new VectorStore({ | |
| persistDir: "/tmp/hybrid-test", | |
| indexFilename: "test-index.json", | |
| }); | |
| const semantic = new SemanticSearchService(embeddingService, vectorStore); | |
| const keyword = new KeywordSearchEngine(); | |
| return new HybridSearchService(semantic, keyword); | |
| } | |
| function buildService(embedFn?: EmbedFn) { | |
| const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn()); | |
| const vectorStore = new VectorStore({ | |
| persistDir: "/tmp/hybrid-test", | |
| indexFilename: "test-index.json", | |
| modelId: "mock-model", | |
| }); | |
| const semantic = new SemanticSearchService(embeddingService, vectorStore); | |
| const keyword = new KeywordSearchEngine(); | |
| return new HybridSearchService(semantic, keyword); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/__tests__/hybridSearch.test.ts` around lines 42 - 51, The test helper
buildService creates a VectorStore without a modelId, so index compatibility
won't be checked; update buildService to pass a modelId (e.g., use the
EmbeddingService's modelId) into the VectorStore config when constructing it:
after creating embeddingService (EmbeddingService), include modelId:
embeddingService.modelId in the new VectorStore({...}) options so VectorStore
performs the same model compatibility/invalidation behavior as in production
before returning the HybridSearchService.
| /** Load embedding model + persisted index. Call before anything else. */ | ||
| async initialize(): Promise<void> { | ||
| await this.embeddingService.initialize(); | ||
| await this.vectorStore.load(); | ||
| } |
There was a problem hiding this comment.
Potential inconsistent state if vectorStore.load() throws.
If vectorStore.load() throws (e.g., on model mismatch per VectorStore.load() at lines 145-148), embeddingService remains initialized but the vector store is not loaded. This creates asymmetric state and the error isn't caught for recovery.
Consider wrapping the initialization sequence or documenting the expected caller behavior:
🛡️ Proposed fix to handle vectorStore.load() errors
async initialize(): Promise<void> {
await this.embeddingService.initialize();
- await this.vectorStore.load();
+ try {
+ await this.vectorStore.load();
+ } catch (error) {
+ // Log and rethrow - caller should handle index rebuild
+ console.error("VectorStore load failed:", error);
+ throw error;
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/semanticSearch.ts` around lines 42 - 46, The initialize() sequence can
leave embeddingService initialized if vectorStore.load() throws; wrap the
vectorStore.load() call in a try/catch and on failure perform a deterministic
rollback of the embedding service (call the appropriate cleanup/shutdown/dispose
method on embeddingService, e.g., embeddingService.shutdown() or
embeddingService.dispose()) before rethrowing the error so state remains
consistent; keep the calls inside the existing initialize() method and preserve
the original error propagation.
| this.entries.clear(); | ||
| for (const entry of data.entries) { | ||
| this.entries.set(entry.chunk.id, { | ||
| chunk: entry.chunk, | ||
| vector: entry.vector, | ||
| }); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding structural validation for loaded data.
If the index file contains valid JSON but an unexpected structure (e.g., entries is not an array, or entries lack required chunk.id properties), the code will fail with a cryptic error. A brief structural check would improve debuggability.
♻️ Optional defensive validation
+ if (!Array.isArray(data.entries)) {
+ throw new Error(
+ `VectorStore: Invalid index structure. Delete "${filePath}" to rebuild.`
+ );
+ }
+
this.entries.clear();
for (const entry of data.entries) {
this.entries.set(entry.chunk.id, {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| this.entries.clear(); | |
| for (const entry of data.entries) { | |
| this.entries.set(entry.chunk.id, { | |
| chunk: entry.chunk, | |
| vector: entry.vector, | |
| }); | |
| } | |
| if (!Array.isArray(data.entries)) { | |
| throw new Error( | |
| `VectorStore: Invalid index structure. Delete "${filePath}" to rebuild.` | |
| ); | |
| } | |
| this.entries.clear(); | |
| for (const entry of data.entries) { | |
| this.entries.set(entry.chunk.id, { | |
| chunk: entry.chunk, | |
| vector: entry.vector, | |
| }); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai/vectorStore.ts` around lines 151 - 157, The loaded index JSON needs
structural validation before populating this.entries: verify that the top-level
data is an object and data.entries is an array, then iterate and check each
entry has entry.chunk && entry.chunk.id (string/number) and entry.vector
(array/typed values) before calling this.entries.set; for any invalid/missing
fields either skip the entry with a descriptive processLogger.warn or throw a
clear Error that includes which entry/index failed, so the loader (the method
that currently clears this.entries and loops over data.entries) fails with an
actionable message instead of a cryptic runtime error.
| export default defineConfig({ | ||
| test: { | ||
| globals: true, | ||
| environment: "node", | ||
| include: ["src/**/__tests__/**/*.test.ts"], | ||
| coverage: { | ||
| provider: "v8", | ||
| include: ["src/ai/**/*.ts"], | ||
| exclude: ["src/ai/__tests__/**", "src/ai/index.ts"], | ||
| }, | ||
| }, | ||
| }); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
# Check package.json for `@huggingface/transformers` dependency
grep -i "huggingface\|transformers" package.jsonRepository: AOSSIE-Org/SmartNotes
Length of output: 106
🏁 Script executed:
# Search for any imports of `@huggingface/transformers` in the codebase
rg "@huggingface/transformers|from.*transformers" --type ts --type tsx --type js --type jsonRepository: AOSSIE-Org/SmartNotes
Length of output: 92
🏁 Script executed:
# Read the full vitest.config.ts to see current configuration
cat -n vitest.config.tsRepository: AOSSIE-Org/SmartNotes
Length of output: 531
🏁 Script executed:
# Check if there are any embeddings-related files or tests
fd -e ts -e tsx | grep -i embedRepository: AOSSIE-Org/SmartNotes
Length of output: 121
🏁 Script executed:
# Check for any existing ESM/SSR configuration in vitest config
rg "ssr|inline|external|deps" vitest.config.tsRepository: AOSSIE-Org/SmartNotes
Length of output: 47
🏁 Script executed:
# Check embeddings.ts imports
cat -n src/ai/embeddings.tsRepository: AOSSIE-Org/SmartNotes
Length of output: 4511
🏁 Script executed:
# Check embeddings.test.ts content
cat -n src/ai/__tests__/embeddings.test.tsRepository: AOSSIE-Org/SmartNotes
Length of output: 6516
🏁 Script executed:
# Search for all references to `@huggingface` in the codebase
rg "@huggingface" --type ts --type jsRepository: AOSSIE-Org/SmartNotes
Length of output: 249
Consider adding explicit Vitest configuration for @huggingface/transformers ESM handling.
While @huggingface/transformers (^3.4.0) is a declared dependency and embeddings.ts uses dynamic import with lazy loading, the current Vitest config lacks ESM/external package handling. The test suite currently avoids triggering the real import by injecting mock implementations, so tests pass. However, if code paths exercise the actual initialize() method without a custom embedFn, the Node ESM environment may require explicit server.deps.inline or ssr.noExternal configuration to prevent module resolution issues. This is a precautionary measure to ensure compatibility when real-world code paths are exercised.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@vitest.config.ts` around lines 3 - 14, Add explicit Vitest/Vite ESM handling
for the `@huggingface/transformers` package so tests that exercise real
initialize() paths (e.g., in embeddings.ts where initialize() may dynamic-import
the library or when embedFn is not provided) don't break under Node ESM
resolution; update the vitest.config.ts test config to include the package in
server.deps.inline or ssr.noExternal (or both) to force inlining/noExternal for
"@huggingface/transformers" so the module resolves correctly during tests and
runtime.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/ai/hybridSearch.ts (1)
55-60:⚠️ Potential issue | 🟠 MajorIndex both engines from the same chunk array.
SemanticSearchService.indexNote()already re-chunks internally (src/ai/semanticSearch.ts:42-62), and this method chunks the raw content again for BM25. If those chunkers ever diverge, the same chunk ID can point to different text in each engine and RRF will fuse unrelated hits. Generate the chunks once here and feed that exact array to both indexes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/hybridSearch.ts` around lines 55 - 60, The code currently chunks content twice causing mismatched chunk IDs; call this.chunkFn(noteId, content) once to produce a single chunks array and feed that same array to both engines. Update the call site in hybridSearch.indexNote to use the single chunks array and change SemanticSearchService.indexNote (or add a new semanticSearch.indexChunks) to accept the precomputed chunks array instead of raw content so it does not re-chunk internally; then pass chunks to keywordEngine.indexChunks(chunks) and to the semantic search method, and remove the internal re-chunking in SemanticSearchService.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai/__tests__/hybridSearch.test.ts`:
- Around line 42-47: buildService currently hardcodes VectorStore to
"/tmp/hybrid-test", making tests non-hermetic; change buildService to create a
unique temp directory per invocation (e.g., using fs.mkdtempSync or a UUID
appended to os.tmpdir()) and pass that path into the new VectorStore({
persistDir, indexFilename }) so each test gets its own directory, and ensure the
test suite's beforeEach()/afterEach() (or initialize()/teardown) removes or
cleans that temp dir after the test to avoid cross-test contamination.
In `@src/ai/hybridSearch.ts`:
- Around line 51-53: The initialize() method restores only the semantic store
(semanticSearch) but does not rebuild or restore the BM25 keyword index
(keywordEngine), so keywordEngine is empty after restart; update initialize() to
load persisted chunks/documents (the same source semanticSearch uses) and
rebuild or re-index keywordEngine from those persisted chunks, and update save()
to persist either the keywordEngine state or the chunks/documents used to
rebuild it (so keywordEngine can be reconstructed on startup); reference the
initialize() and save() methods, keywordEngine, semanticSearch and the persisted
chunks/documents used by the vector store when implementing this.
- Around line 76-87: The keyword results are being truncated by
KeywordSearchEngine.search before you apply the noteId filter, causing
note-scoped searches to miss matches; fix by applying noteId filtering inside
the keyword search path: either extend KeywordSearchEngine.search to accept an
optional noteId (and filter before slice) and call keywordEngine.search(query,
broadK, opts.noteId) here, or request the full/un-truncated set (e.g., topK =
Infinity or a large number) so you can filter by opts.noteId in this module
before slicing to broadK; update references to keywordEngine.search and ensure
filteredKeyword is produced from the pre-sliced results.
---
Duplicate comments:
In `@src/ai/hybridSearch.ts`:
- Around line 55-60: The code currently chunks content twice causing mismatched
chunk IDs; call this.chunkFn(noteId, content) once to produce a single chunks
array and feed that same array to both engines. Update the call site in
hybridSearch.indexNote to use the single chunks array and change
SemanticSearchService.indexNote (or add a new semanticSearch.indexChunks) to
accept the precomputed chunks array instead of raw content so it does not
re-chunk internally; then pass chunks to keywordEngine.indexChunks(chunks) and
to the semantic search method, and remove the internal re-chunking in
SemanticSearchService.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 05c80561-5fea-473f-a2bf-67aaf9d9ebff
📒 Files selected for processing (4)
src/ai/__tests__/hybridSearch.test.tssrc/ai/hybridSearch.tssrc/ai/index.tssrc/ai/types.ts
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/ai/hybridSearch.ts (2)
72-83:⚠️ Potential issue | 🟠 MajorFilter keyword hits before truncating to
topK.
KeywordSearchEngine.search()slices before returning, so filtering that reduced list at Line 81 can drop valid note-scoped matches whenever other notes occupy the firstbroadKBM25 slots.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/hybridSearch.ts` around lines 72 - 83, Keyword search results are pre-sliced by KeywordSearchEngine.search, so filtering by opts.noteId after that (into filteredKeyword) can drop valid note-scoped hits; change the flow to fetch untruncated keyword hits (e.g., expose/use a searchAll / unsliced variant or call keywordEngine.search with a sufficiently large K) then apply the noteId filter (opts.noteId) to that full set and only after filtering truncate to broadK (used alongside semanticResults); update usages around keywordEngine.search, filteredKeyword, broadK, and opts.noteId accordingly.
47-49:⚠️ Potential issue | 🟠 MajorPersist or rebuild the BM25 index on startup.
initialize()restores only the semantic store, andsave()persists only that same state. After a restart,keywordEngineis empty, so hybrid search silently degrades to semantic-only until every note is reindexed.Also applies to: 96-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/hybridSearch.ts` around lines 47 - 49, initialize() currently only restores the semantic store so keywordEngine/BM25 is empty after restart; update initialize() (and the save() path referenced near lines 96-98) to persist and restore the BM25/keywordEngine state or trigger a rebuild: add serialization of the BM25 index when save() runs and load/deserializer logic in initialize() to repopulate keywordEngine, and if no serialized snapshot exists call the existing reindex routine (e.g., reindexAll or the method that iterates notes) to rebuild the BM25 index on startup; ensure you reference and update keywordEngine, initialize(), save(), and the BM25 serialization/rebuild helpers so hybridSearch is fully restored after restart.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/ai/hybridSearch.ts`:
- Around line 72-83: Keyword search results are pre-sliced by
KeywordSearchEngine.search, so filtering by opts.noteId after that (into
filteredKeyword) can drop valid note-scoped hits; change the flow to fetch
untruncated keyword hits (e.g., expose/use a searchAll / unsliced variant or
call keywordEngine.search with a sufficiently large K) then apply the noteId
filter (opts.noteId) to that full set and only after filtering truncate to
broadK (used alongside semanticResults); update usages around
keywordEngine.search, filteredKeyword, broadK, and opts.noteId accordingly.
- Around line 47-49: initialize() currently only restores the semantic store so
keywordEngine/BM25 is empty after restart; update initialize() (and the save()
path referenced near lines 96-98) to persist and restore the BM25/keywordEngine
state or trigger a rebuild: add serialization of the BM25 index when save() runs
and load/deserializer logic in initialize() to repopulate keywordEngine, and if
no serialized snapshot exists call the existing reindex routine (e.g.,
reindexAll or the method that iterates notes) to rebuild the BM25 index on
startup; ensure you reference and update keywordEngine, initialize(), save(),
and the BM25 serialization/rebuild helpers so hybridSearch is fully restored
after restart.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bb239b31-74cd-431f-891e-2c3a8539b8d7
📒 Files selected for processing (2)
src/ai/hybridSearch.tssrc/ai/semanticSearch.ts
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/ai/vectorStore.ts (1)
149-165:⚠️ Potential issue | 🟠 MajorValidate every loaded row before rebuilding the map.
This path only proves that the JSON parses and that the first entry matches
vectorDim. Valid JSON with the wrong shape can still fail with opaqueTypeErrors at Line 149 or Line 161, and a later mixed-dimension row slips through until search time. Validateentriesup front and check each row'schunkfields plusvectorlength before populatingthis.entries.Proposed fix
+ if (!Array.isArray(data.entries)) { + throw new Error( + `VectorStore: invalid index structure in "${filePath}". ` + + `Delete the file to rebuild the index.`, + ); + } + - if (data.vectorDim !== undefined && data.entries.length > 0) { - const actualDim = data.entries[0].vector.length; - if (actualDim !== data.vectorDim) { - throw new Error( - `VectorStore: index vectorDim mismatch ` + - `(stored ${data.vectorDim}, found ${actualDim}). Index may be corrupted.`, - ); - } - } - this.entries.clear(); - for (const entry of data.entries) { + for (const [index, entry] of data.entries.entries()) { + if ( + !entry || + !entry.chunk || + typeof entry.chunk.id !== "string" || + typeof entry.chunk.noteId !== "string" || + typeof entry.chunk.content !== "string" || + typeof entry.chunk.chunkIndex !== "number" || + !Array.isArray(entry.vector) + ) { + throw new Error( + `VectorStore: invalid entry at index ${index} in "${filePath}". ` + + `Delete the file to rebuild the index.`, + ); + } + if ( + data.vectorDim !== undefined && + entry.vector.length !== data.vectorDim + ) { + throw new Error( + `VectorStore: invalid vector dimension at entry ${index} in "${filePath}".`, + ); + } this.entries.set(entry.chunk.id, { chunk: entry.chunk, vector: entry.vector, }); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/vectorStore.ts` around lines 149 - 165, Validate the loaded payload before rebuilding this.entries: confirm data.entries is an array and iterate each entry to ensure it has a chunk object with required fields (at least chunk.id) and a numeric vector array whose length matches data.vectorDim (when data.vectorDim is defined) and all elements are numbers; throw a clear Error identifying the offending entry if any check fails; only after all entries pass validation, clear this.entries and populate it using entry.chunk.id -> { chunk: entry.chunk, vector: entry.vector } so malformed rows never get inserted (referencing data.entries, data.vectorDim, entry.chunk, entry.vector, and this.entries).src/ai/__tests__/vectorStore.test.ts (1)
188-199:⚠️ Potential issue | 🟡 MinorAssert the no-op, not just the final contents.
Line 191 says the second
save()should be a no-op, but this test still passes ifsave()rewrites the same JSON. Capture file metadata before/after the second call, or spy on the write path, so the test proves the optimization instead of only re-reading the final file contents.Proposed fix
-import { mkdtemp, rm, readFile, writeFile } from "node:fs/promises"; +import { mkdtemp, rm, readFile, writeFile, stat } from "node:fs/promises"; ... it("skips write when nothing changed", async () => { store.add([makeChunk("n1", 0, [1, 0, 0])]); await store.save(); + const before = await stat(join(tempDir, "test-index.json")); await store.save(); // second save should be no-op + const after = await stat(join(tempDir, "test-index.json")); const content = await readFile( join(tempDir, "test-index.json"), "utf-8", ); const data = JSON.parse(content); + expect(after.mtimeMs).toBe(before.mtimeMs); expect(data.version).toBe(1); expect(data.entries).toHaveLength(1); });As per coding guidelines,
**/*.test.{ts,tsx,js,jsx}: The tests are not tautological.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/__tests__/vectorStore.test.ts` around lines 188 - 199, The test currently only verifies final file contents but must assert the second store.save() is a true no-op; modify the "skips write when nothing changed" test to capture file metadata or spy the write path before the second save (e.g., stat the file mtime or spy on fs.writeFile/Store's internal write method used by save()), call await store.save() the second time, then assert that no write occurred (mtime unchanged or spy not called) in addition to the existing content checks; locate the logic around store.save() and the file read (readFile/join usage) to add the stat/spy and the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai/vectorStore.ts`:
- Around line 28-33: The add(embeddedChunks: EmbeddedChunk[]) method currently
mutates this.entries before validating embedding sizes, allowing mixed-dimension
vectors to be stored; update add to first determine the expected vector length
(use an existing this.dimension property or set it from the first inserted
ec.embedding.length if missing) and validate every ec.embedding.length against
that expected dimension, throwing an error and refusing to mutate this.entries
if any mismatch is found; ensure you set this.dimension on the first successful
insert and only mark this.dirty = true after all entries have been validated and
inserted.
---
Duplicate comments:
In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 188-199: The test currently only verifies final file contents but
must assert the second store.save() is a true no-op; modify the "skips write
when nothing changed" test to capture file metadata or spy the write path before
the second save (e.g., stat the file mtime or spy on fs.writeFile/Store's
internal write method used by save()), call await store.save() the second time,
then assert that no write occurred (mtime unchanged or spy not called) in
addition to the existing content checks; locate the logic around store.save()
and the file read (readFile/join usage) to add the stat/spy and the assertion.
In `@src/ai/vectorStore.ts`:
- Around line 149-165: Validate the loaded payload before rebuilding
this.entries: confirm data.entries is an array and iterate each entry to ensure
it has a chunk object with required fields (at least chunk.id) and a numeric
vector array whose length matches data.vectorDim (when data.vectorDim is
defined) and all elements are numbers; throw a clear Error identifying the
offending entry if any check fails; only after all entries pass validation,
clear this.entries and populate it using entry.chunk.id -> { chunk: entry.chunk,
vector: entry.vector } so malformed rows never get inserted (referencing
data.entries, data.vectorDim, entry.chunk, entry.vector, and this.entries).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f46da6c2-eb49-4565-8c70-ea5406ae89a6
📒 Files selected for processing (2)
src/ai/__tests__/vectorStore.test.tssrc/ai/vectorStore.ts
|
@dorodb-web22 really solid first implementation- BM25 + RRF + local embeddings in pure TypeScript with 77 tests is impressive work. A few things I noticed: Blocking concerns:
the per PR fixes you have been pushing (chunker reuse, error descriptiveness, fusionScore) are exactly the right things to do. the 3 blocking items above are the main things I would want to see resolved before merge. happy to discuss any of them... |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/ai/vectorStore.ts (1)
164-179:⚠️ Potential issue | 🟠 MajorValidate every loaded entry before rebuilding the index.
At Line 164-Line 179,
load()still trusts persisted JSON shape and only checks dimension consistency using the first vector. A malformed/corrupted file can still inject invalid entries or mixed dimensions, causing downstream runtime failures during search.🔧 Proposed hardening for load-time validation
- if (data.vectorDim !== undefined && data.entries.length > 0) { - const actualDim = data.entries[0].vector.length; - if (actualDim !== data.vectorDim) { - throw new Error( - `VectorStore: index vectorDim mismatch ` + - `(stored ${data.vectorDim}, found ${actualDim}). Index may be corrupted.`, - ); - } - } - - this.entries.clear(); - for (const entry of data.entries) { - this.entries.set(entry.chunk.id, { - chunk: entry.chunk, - vector: entry.vector, - }); - } + if (!data || !Array.isArray(data.entries)) { + throw new Error( + `VectorStore: invalid index structure in "${filePath}". Delete it to rebuild.`, + ); + } + + const expectedDim = + data.vectorDim ?? + (data.entries.length > 0 ? data.entries[0].vector.length : undefined); + + this.entries.clear(); + for (let i = 0; i < data.entries.length; i++) { + const entry = data.entries[i]; + if ( + !entry || + !entry.chunk || + typeof entry.chunk.id !== "string" || + typeof entry.chunk.noteId !== "string" || + !Array.isArray(entry.vector) + ) { + throw new Error( + `VectorStore: invalid entry at index ${i} in "${filePath}". Delete it to rebuild.`, + ); + } + + if (expectedDim !== undefined && entry.vector.length !== expectedDim) { + throw new Error( + `VectorStore: inconsistent vector dimensions in "${filePath}" ` + + `(expected ${expectedDim}, found ${entry.vector.length} at entry ${i}).`, + ); + } + + this.entries.set(entry.chunk.id, { + chunk: entry.chunk, + vector: entry.vector, + }); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/vectorStore.ts` around lines 164 - 179, The load() path currently only checks vectorDim against the first entry and then trusts data.entries; instead, validate every entry before rebuilding this.entries: iterate over data.entries and for each entry confirm entry.chunk and entry.chunk.id exist, entry.vector is an array of numbers, and entry.vector.length matches data.vectorDim (or infer and set vectorDim if undefined) and throw a descriptive Error on mismatch; only after all entries pass validation populate this.entries (used in the for loop that calls this.entries.set) to avoid injecting malformed or mixed-dimension vectors into the index.src/ai/__tests__/vectorStore.test.ts (1)
188-200:⚠️ Potential issue | 🟡 MinorMake the no-op save test prove no second write happened.
Line 188-Line 200 currently validates content only, so it still passes if
save()rewrites identical JSON on the second call.✅ Proposed test tightening
-import { mkdtemp, rm, readFile, writeFile } from "node:fs/promises"; +import { mkdtemp, rm, readFile, writeFile, stat } from "node:fs/promises"; @@ it("skips write when nothing changed", async () => { store.add([makeChunk("n1", 0, [1, 0, 0])]); await store.save(); + const before = await stat(join(tempDir, "test-index.json")); await store.save(); // second save should be no-op + const after = await stat(join(tempDir, "test-index.json")); const content = await readFile( join(tempDir, "test-index.json"), "utf-8", ); const data = JSON.parse(content); + expect(after.mtimeMs).toBe(before.mtimeMs); expect(data.version).toBe(1); expect(data.entries).toHaveLength(1); });As per coding guidelines,
**/*.test.{ts,tsx,js,jsx}: The tests are not tautological.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai/__tests__/vectorStore.test.ts` around lines 188 - 200, The test "skips write when nothing changed" currently only asserts file contents, so it can pass if save() rewrites identical JSON; update the test to prove the second save did not perform a write by either (a) spying on the file write API used by VectorStore (e.g. jest.spyOn(fs.promises, "writeFile") or the actual writer used by store.save) and asserting it was called exactly once, or (b) recording the file mtime via fs.promises.stat on "test-index.json" before the second await store.save() and asserting the mtime did not change after the second save; reference the test name, store.save, and the "test-index.json" file when adding the spy/stat checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 188-200: The test "skips write when nothing changed" currently
only asserts file contents, so it can pass if save() rewrites identical JSON;
update the test to prove the second save did not perform a write by either (a)
spying on the file write API used by VectorStore (e.g. jest.spyOn(fs.promises,
"writeFile") or the actual writer used by store.save) and asserting it was
called exactly once, or (b) recording the file mtime via fs.promises.stat on
"test-index.json" before the second await store.save() and asserting the mtime
did not change after the second save; reference the test name, store.save, and
the "test-index.json" file when adding the spy/stat checks.
In `@src/ai/vectorStore.ts`:
- Around line 164-179: The load() path currently only checks vectorDim against
the first entry and then trusts data.entries; instead, validate every entry
before rebuilding this.entries: iterate over data.entries and for each entry
confirm entry.chunk and entry.chunk.id exist, entry.vector is an array of
numbers, and entry.vector.length matches data.vectorDim (or infer and set
vectorDim if undefined) and throw a descriptive Error on mismatch; only after
all entries pass validation populate this.entries (used in the for loop that
calls this.entries.set) to avoid injecting malformed or mixed-dimension vectors
into the index.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b588922c-3618-4ad6-bff6-a50b6d1151d4
📒 Files selected for processing (2)
src/ai/__tests__/vectorStore.test.tssrc/ai/vectorStore.ts
Description
This PR introduces Hybrid Search to the AI module, building on top of the recent embedding pipeline by integrating keyword search and combining the results using Reciprocal Rank Fusion (RRF). This ensures that searches are robust against exact-match keywords while still benefiting from semantic understanding.
Key additions:
All 77 AI-related tests are passing locally and TypeScript compilation is clean. I'm open to any feedback on the implementation or tuning parameters!
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Tests
Chores